Author: Paulina Tomaszewska
NOTE: There is no detailed description of column names (the email to the package author was already sent). Only available explanation:
A data frame with 10000 rows and 70 variables:
import numpy as np
import pandas as pd
import pyreadr
train = pyreadr.read_r('hmc_train.Rda')['train']
valid = pyreadr.read_r('hmc_valid.Rda')['valid']
def xy_split(data, y_name="PURCHASE"):
return data.drop([y_name], axis=1), data[y_name]
X_train, Y_train = xy_split(train)
X_test, Y_test = xy_split(valid)
X_test = X_test[X_train.columns]
X_train.info()
Y_train.sum()/len(Y_train)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
def train_xgb_model(X_train, Y_train, X_valid, Y_valid):
xgmodel = XGBClassifier(max_depth=5,
learning_rate=0.05,
n_estimators=100,
objective='binary:logistic',
gamma=0.01)
xgmodel.fit(X_train, Y_train, verbose=True)
valid_score = xgmodel.score(X_valid, Y_valid)
print("xgboost valid score {}".format(valid_score))
return xgmodel
def train_xgb_model1(X_train, Y_train, X_valid, Y_valid):
xgmodel = XGBClassifier(max_depth=13,
learning_rate=0.15,
n_estimators=100,
objective='binary:logistic',
gamma=0.01)
xgmodel.fit(X_train, Y_train, verbose=True)
valid_score = xgmodel.score(X_valid, Y_valid)
print("xgboost valid score {}".format(valid_score))
return xgmodel
xgbmodel = train_xgb_model(X_train.values, Y_train, X_test.values, Y_test)
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names=X_train.columns,
class_names=['no purchase', 'purchase'], discretize_continuous=True)
model_xgb_predict=lambda x: xgbmodel.predict_proba(x).astype(float)
def explain_observation(index, model_predict):
exp = explainer.explain_instance(X_test.values[index], model_predict, num_features=5)
exp.show_in_notebook(show_table=True, show_all=False)
return exp.as_list()
explain_observation(2, model_xgb_predict)
The model was almost certain that the correct class is the one saying that the client didn't buy a product (no purchase). The biggest impact on the decision had:
Note: The first variable has twice as big impact as the second.
Whereas the variables that are for the class "purchase" are:
explain_observation(9, model_xgb_predict)
The model had a problem to classify the observation (each of classes got 0.5 probability). The biggest impact on the decision that the correct class is "purchase" had:
Whereas the variable that is for the class "no purchase" is:
explain_observation(11, model_xgb_predict)
The model pointed as correct the class "purchase". The biggest impact on this decision had:
Whereas the variable that is for the class "no purchase" is:
from sklearn.neural_network import MLPClassifier
mlp=MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(50, 5), random_state=1)
mlp.fit(X_train, Y_train)
valid_score = mlp.score(X_test, Y_test)
print("MLP valid score {}".format(valid_score))
It seems that the model learned to predict in every case the label "no purchase". Further parameter tuning would be needed but this is not the goal of this homework.
explain_observation(2, mlp.predict_proba)
MLPClassifier as the XGBoost was certain that the correct class is "no purchase". The decision was based on:
Whereas the following values were suggesting that the correct label is "purchase":
explain_observation(9, mlp.predict_proba)
In case of XGBoost classifier the model didn't know which class was the correct one, whereas MLPClassifier pointed the label "no purchase". Arguments for the label "no purchase":
Arguments for the label "purchase":
explain_observation(11, mlp.predict_proba)
As noted above, the model seems predict that the correct class is "no purchase" in every case - the probability is equal the ratio of the class 0. Arguments for class "no purchase":
Arguments for the class "purchase":